Bayesian Distance Clustering

Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.


Introduction
Clustering is a primary focus of many statistical analyses, providing a valuable tool for exploratory data analysis and simplification of complex data. In the literature, there are two primary approaches -distance-and model-based clustering. Let y i ∈ Y, for i = 1, . . . , n, denote the data and let d(y, y ) denote a distance between data points y and y . Then, distance-based clustering algorithms are typically applied to the n × n matrix of pairwise distances D (n)×(n) = {d i,j }, with d i,j = d(y i , y j ) for all i, j pairs and (n) = {1, . . . , n}. For reviews, see Jain (2010); Xu and Tian (2015). In contrast, model-based clustering takes a likelihood-based approach in building a model for the original data y (n) that has the form: parametric family, such as the Gaussian, with θ denoting the parameters. The finite mixture model (1) can be obtained by marginalizing out the cluster index c i ∈ {1, . . . , k} in the following model: Using this data-augmented form, one can obtain maximum likelihood estimates of the model parameters π and θ = {θ h } via an expectation-maximization algorithm (Fraley and Raftery, 2002). Alternatively, Bayesian methods are widely used to include prior information and characterize uncertainty in the parameters. For reviews, see Bouveyron and Brunet-Saumard (2014) and . Distance-based algorithms tend to have the advantage of being relatively simple conceptually and computationally, while a key concern is the lack of characterization of uncertainty in clustering estimates and associated inferences. While model-based methods can address these concerns by exploiting a likelihood-based framework, a key disadvantage is a large sensitivity to the choice of kernel K(·; θ). Often, kernels are chosen for simplicity and computational convenience, and they place rigid assumptions on the shape of the clusters, which are not justified by the applied setting being considered.
We are not the first to recognize this problem, and there is literature attempting to address issues with kernel robustness in model-based clustering. One direction is to choose a flexible class of kernels, which can characterize a wide variety of densities. For example, one can replace the Gaussian kernel with one that accommodates asymmetry, skewness and/or heavier tails (Karlis and Santourian (2009);Juárez and Steel (2010); O' Hagan et al. (2016); Gallaugher and McNicholas (2018); among others). A related direction is to nonparametrically estimate the kernels specific to each cluster, while placing minimal constraints for identifiability, such as unimodality and sufficiently light tails. This direction is related to the mode-based clustering algorithms of Li et al. (2007); see also Rodríguez and Walker (2014) for a Bayesian approach using unimodal kernels. Unfortunately, as discussed by Hennig et al. (2015), a kernel that is too flexible leads to ambiguity in defining a cluster and identifiability issues: for example, one cluster can be the union of several clusters that are close. Practically, such flexible kernels demand a large number of parameters, leading to daunting computation cost.
A promising new strategy is to replace the likelihood with a robust alternative. Coretto and Hennig (2016) propose a pseudo-likelihood based approach for robust multivariate clustering, which captures outliers with an extra improper uniform component. Miller and Dunson (2018) propose a coarsened Bayes approach for robustifying Bayesian inference and apply it to clustering problems. Instead of assuming that the observed data are exactly generated from (1) in defining a Bayesian approach, they condition on the event that the empirical probability mass function of the observed data is within some small neighborhood of that for the assumed model. Both of these methods aim to allow small deviations from a simple kernel. It is difficult to extend these approaches to data with high complexity, such as clustering multiple time series, images, etc.
We propose a new approach based on a Bayesian model for the pairwise distances, avoiding a complete specification of the likelihood function for the data y (n) . There is a rich literature proposing Bayesian approaches that replace an exact likelihood function with some alternative. Chernozhukov and Hong (2003) consider a broad class of such quasi-posterior distributions. Jeffreys (1961) proposed a substitution likelihood for quantiles for use in Bayesian inference; also refer to Dunson and Taylor (2005). Hoff (2007) proposed a Bayesian approach to inference in copula models, which avoids specifying models for the marginal distributions via an extended rank likelihood. Johnson (2005) proposed Bayesian tests based on modeling frequentist test statistics instead of the data directly. These are just some of many examples.
Our proposed Bayesian distance clustering approach gains some of the advantages of modelbased clustering, such as uncertainty quantification and flexibility, while significantly simplifying the model specification task. There is a connection between our approach and nonnegative matrix factorization (NMF) methods (Kuang et al., 2012;Zhao et al., 2015;Kuang et al., 2015). Certain NMF algorithms can be viewed as fast approximations to our likelihood-based approach. Our major contributions are: (i) establishing a novel link between model-and distance-based frameworks, (ii) introducing a principled choice for assigning kernels for distances (equivalent to the affinity/similarity score in NMFs), and (iii) providing a way to calibrate the parameters within the proposed probabilistic framework.

Partial likelihood for distances 2.1 Motivation for partial likelihood
Suppose that data y (n) are generated from model (1) or equivalently (2). We focus on the case in which y i = (y i,1 , . . . , y i,p ) ∈ Y ⊂ R p . The conditional likelihood of the data y (n) given clustering indices c (n) can be expressed as n h ,1 ): 1 given the differences. Expression (4) is a product of the densities of the seed and (n h − 1) differences. As the cluster size n h increases, the relative contribution of the seed density K h (y 1 | .). We now use a toy example to illustrate how to derive the function G h from a known modelbased likelihood (later we will show how to specify G h directly, when the model-based likelihood is not known). Consider the case of y i ∈ R from a Gaussian mixture, starting from the likelihood of those y i 's associated with c i = h: 1 , and integrate out y n h ,1 ]: i,1 ) 2 /n h ]. In the above example, note that the right hand side appears to be a product density of all the pairwise differencesD [h] = {d [h] i,j } (i,j) , besides those formed with the seed. This is due to the linear equality thatd j,1 -therefore, there are effectively only (n h −1) free random variables; once they are given, the rest are completely determined.
Generalizing from this form, we now specify G as where g h : R p → R + and eachd [h] i,j is assigned a marginal density. The power 1/n h is a calibration parameter that adjusts the order discrepancy between the numbers of (n h − 1)n h /2 marginal densities and (n h − 1) effective random variables. We will formally justify this calibration in the theory section.
We now state the assumptions that we use for clustering.
Assumption 1 For those data within a cluster, y [h] i and y [h] j are independent and identically distributed.
Focusing on marginally specifying each g h , we can immediately obtain two key properties of d j : (1) Expectation zero, and (2) Marginal symmetry with skewness zero. Hence, the distribution of the differences is substantially simpler than the original data distribution K h . This suggests using G h for clustering will substantially reduce the model complexity and improve robustness.
We connect the density of the differences to a likelihood of 'distances' -here used as a loose notion including metrics, semi-metrics and divergences. Consider d i,j ∈ [0, ∞) as a transform of d i,j , such as some norm d i,j = d i,j (e.g. Euclidean or 1-norm); hence, a likelihood for d i,j is implicitly associated to a pushforward measure from the one ond i,j (assuming a measurable transform). For example, an exponential density on d i,j = d i,j 1 can be taken as the result of assigning a multivariate Laplace ond i,j . We can further generalize the notion of difference from subtraction to other types, such as ratio, cross-entropy, or an application-driven specification (Izakian et al., 2015).
To summarize, this motivates the practice of first calculating a matrix of pairwise distances, and then assigning a partial likelihood for clustering. For generality, we slightly abuse notation and replace the difference arrayD with the distance matrix D in (5), and denote the density by G h (D [h] ). We will refer to (5) as the distance likelihood from now on. Conditional on the clustering labels, with c i ∼ k h=1 π h δ h independently, as is (2). To provide some intuition about the clustering uncertainty, we simulate two clusters using the bivariate skewed Gaussian distribution: in each dimension, the first cluster has a skewness of 4 (red in Figure 1, left panel), location of 0 and scale of 1, and the second has a skewness of −2, location of 0.5 and scale of 1 (grey in Figure 1, left panel); the two sub-coordinates are generated independently. Via the skew Gaussian density K h (y i ) used to generate the data, we can compute the oracle assignment probability pr(c i = h | y i ) for h = 1 and 2, and the most likely cluster assignmentĉ i for each data point.
Clearly, due to the overlapping of the two clusters, there is a significant amount of uncertainty for those near the location (0, 0), as the pr(c i =ĉ i ) remains away from 0 ( Figure 1, center panel) -importantly, such uncertainty does not vanish even as n → ∞, as these points will remain nearly equidistant to the two cluster centers. Using the distance likelihood on D, we can obtain an easy quantification of the uncertainty, by sampling c i from the posterior distribution and calculating the co-assignment probabilities pr(c i = c j | D); as shown in the right panel, the off-diagonal block shows that a significant portion of data that can be co-assigned to either the first or the second cluster with a non-trivial probability.

Choosing a distance density for clustering
To implement our Bayesian distance clustering approach, we need a definition of clusters, guiding us to choose a parametric form for g h (.) in (5). For conciseness, we will focus on the norm-based distances from now on. A popular intuition for a cluster is a group of data points, such that most of the distances among them are relatively small. That is, the probability of finding large distances within a cluster should be low. We now state the assumption.
Assumption 2 With σ h > 0, a scale parameter and h a function that rapidly declines towards 0 as t increases.
For such a decline, it is common to consider the sub-exponential rate (Wainwright, 2019), h (t) = O{exp(−t/b)} with some constant b > 0. Ideally, we want σ h to be small, so that most of the distances within a cluster are small. It is tempting to consider using a simple exponential density Exp(σ h ) for modeling d [h] i,j , however, we make an important observation here: the exponential distribution has a deterministic relationship between the mean σ h and the variance σ 2 h -this means any slightly large Ed [h] i,j (such as when the distribution of d [h] i,j does not follow a exponential decay near zero) will inflate the estimate of σ h , making it difficult to use small distances for clustering.
Left is the first two dimensions, and the right show that the distances formed within a cluster (cyan) tend to be much smaller than the ones across clusters (red). Each cluster's data are generated from a multivariate Laplace distribution This motivates us to use a two-parameter distribution instead -in this article, we use Gamma (α h , σ h ) for g h in (5), whose variance α h σ 2 h is no longer completely determined by the mean α h σ h .
We defer the prior choice for α h and σ h to a later section. For now, we verify that the Gamma distribution does have a sub-exponential tail that satisfies Assumption 1.
Lemma 2 (Bound on the right tail) If d has the density (8), for any α h ≥ 1 and t > 0, Remark 3 The polynomial term t α h allows deviation from the exponential distribution at small t; its effect vanishes as t increases.
The assumption on the distances are connected to some implicit assumptions on the data distribution K(y i ). As such a link varies with the specific form of the distance, we again focus on the vector norm of subtraction d We now show that a sub-exponential tail for the vector norm distance is a necessary result of assuming sub-exponential tails in K(y i ).

Lemma 4 (Tail of vector norm distance) If there exist bound constants m
then, there exist another two constants ν h , b h > 0, such that for any q ≥ 1 Remark 5 The concentration property (9) is less restrictive than common assumptions on the kernel in a mixture model, such as Gaussianity, log-concavity or unimodality.

Hyper-prior specification for α h and σ h
In Bayesian clustering, it is useful to choose the prior parameters in a reasonable range (Malsiner-Walli et al., 2017). Recall in our gamma density, α h determines the mean for d [h] i,j at α h σ h . To favor small values for the mode while accommodating a moderate degree of uncertainty, we use a Gamma prior α h ∼ Gamma(1.5, 1.0).
To select a prior for σ h , we associate it with a pre-specified maximum cluster number k. We can view k as a packing number -that is, how many balls (clusters) we can fit in a container of the data. To formalize, imagine a p-dimensional ellipsoid in R p enclosing all the observed data. The smallest volume of such an ellipsoid is which can be obtained via a fast convex optimization algorithm (Sun and Freund, 2004), with M = π p/2 /Γ(p/2 + 1) andπ ≈ 3.14.
If we view each cluster as a high-probability ball of points originating from a common distribution, then the diameter -the distance between the two points that are farthest apart -is ∼ 4σ h . This is calculated based on pr(d ≤ 4σ h ) ≈ 0.95 using the gamma density with shape 1.5 (the prior mean of α h ). We denote the ball by Setting k to the packing number yields a sensible prior mean for σ h . For conjugacy, we choose an inverse-gamma prior for The above prior can be used as a default in broad applications, and does not require tuning to each new application.

Theory
We describe several interesting properties for the distance likelihood.
We fill a missing gap between the model-based and distance likelihoods by considering an information-theoretic analysis of the two clustering approaches. This also leads to a principled choice of the power 1/n h in (5).
To quantify the information in clustering, we first briefly review the concept of Bregman divergence (Bregman, 1967). Letting φ : S → R be a strictly convex and differentiable function, with S the domain of φ, the Bregman divergence is defined as where φ(y) denotes the gradient of φ at y. A large family of loss functions, such as squared norm and Kullback-Leibler divergence, are special cases of the Bregman divergence with suitable φ. For model-based clustering, when the regular exponential family ('regular' as the parameter space is a non-empty open set) is used for the component kernel K h , Banerjee et al. (2005) show that there always exists a re-parameterization of the kernel using Bregman divergence. Using our notation, where T (y i ) is a transformation of y i , in the same form as the minimum sufficient statistic for θ h (except this 'statistic' is based on only one data point y i ); µ h is the expectation of T (y i ) taken with respect to K h (y; θ h ); ψ, κ and b φ are functions mapping to (0, ∞).
With this re-parameterization, maximizing the model-based likelihood over c (n) becomes equivalent to minimizing the within-cluster Bregman divergence We will refer to H y as the model-based divergence. For the distance likelihood, considering those distances that can be viewed or re-parameterized as a pairwise Bregman divergence, we assume each g(d [h] i,j ) in the distance likelihood (5) can be re-written with a calibrating power β h > 0 as with z > 0 the normalizing constant. A distance-based divergence H d can be computed as We now compare these two divergences H y and H d at their expectations.
Lemma 7 (Expected Bregman Divergence) The distance-based Bregman divergence (11) in cluster h has where the expectation over y [h] is taken with respect to K h .

Remark 8
The term inside the expectation on the right hand side is the symmetrized Bregman divergence between T (y [h] i ) and µ h (Banerjee et al., 2005). Therefore, There is an order difference of O(n h ) between distance-based and model-based divergences. Therefore, a sensible choice is simply setting β h = 1/n h . This power is related to the weights used in composite pairwise likelihood (Lindsay, 1988;Cox and Reid, 2004).

Relationship to Graph Cut
It is also interesting to consider the matrix form of the distance likelihood. We use C as an n × k binary matrix encoding the cluster assignment, with Then it can be verified that C T C = diag(n 1 , . . . , n k ). Hence the distance likelihood, with the Gamma density, is where D is the n × n distance matrix, log is applied element-wise, Σ = diag(σ 1 , . . . , σ k ), and Λ = diag(α 1 − 1, . . . , α h − 1). If C contains zero columns, the inverse is replaced by a generalized inverse. One may notice some resemblance of (12) to the loss function in graph partitioning algorithms. Indeed, if we simplify the parameters to α 1 = · · · = α k = α 0 and σ 1 = · · · = σ k = σ 0 , then where A = κ1 n,n − D/σ 0 + (α 0 − 1) log D can be considered as an adjacency matrix of a graph formed by a log-Gamma distance kernel, with 1 n,n as an n×n matrix with all elements equal to 1; κ a constant so that each A i,j > 0 (since κ enters the likelihood as a constant tr{C T κ1 n,n C(C T C) −1 } = nκ, it does not impact the likelihood of C). To compare, the popular normalized graph-cut loss (Bandeira et al., 2013) is which is the total edges deleted because of partitioning (weighted by n −1 h to prevent trivial cuts). There is an interesting link between (13) and (14).
Lemma 9 Considering a graph with weighted adjacency matrix A, the normalized graph-cut loss is related to the negative log-likelihood (omitting constant) (13) via

Remark 10
The difference on the right is often known as the degree-based regularization (with n j=1 A i,j the degree, n c i the size of the cluster that data i is assigned to). When the cluster sizes are relatively balanced, we can ignore its effect.
Such a near-equivalence suggests that we can exploit popular graph clustering algorithms, such as spectral clustering, for good initiation of C as a warm-start of the Markov chain Monte Carlo algorithm.

Posterior computation
For the posterior computation, Gibbs sampler is easy to derive as it involves updating one element of the parameter at a time, from the full conditional distribution. However, this could lead to a heavy computational cost for our model. To understand this, consider the update step for each c i , which involves a draw from the categorical distribution: , whereC h denotes a matrix equal to the current value of C, except replacing the ith row with C i,h = 1 and C i,j = 0 for other j = h. Since G(D; C) involves a matrix inverse term (C T C) −1 , the above ratio cannot be simplified to reduce the computational burden. The evaluation cost for each G(D; C) is dominated by the matrix multiplication steps within, hence having an overall cost of O(n 2 k). Therefore, iterating over h = 1, . . . , k and i = 1, . . . , n will lead to a high cost in one sweep of update.
To solve this problem, we instead develop a more efficient algorithm based on the approximate Hamiltonian Monte Carlo (HMC) algorithm. We use a continuous relaxation of each row C i (on a simplex vertex) into the interior of the simplex, and denote the relaxation by W i ∈ ∆ (k−1) \∂ . This is achieved via a tempered softmax re-parameterization (Maddison et al., 2017) At small t > 0 and close to 0, if one v i,h is slightly larger than the rest in {v i,1 , . . . , v i,k }, then w i,h will be close to 1, and all the other w i,h 's close to 0. In this article, we use t = 0.1 as a balance between the approximation accuracy and the numeric stability of the algorithm. In addition, we re-parameterize the other parameters using the softplus function σ h = log[exp(σ h ) + 1] , α h = log[exp(α h ) + 1] for h = 1, . . . , k, and and the softmax function (π 1 , . . . , π k ) = softmax(π 1 , . . . ,π k ) (as defined above except with t = 1), whereσ h ,α h andπ h are all unconstrained parameters in R amenable to the off-the-shelf continuous HMC algorithm.
At each state (β, v), a new proposal (β * , v * ) is generated by simulating Hamiltonian dynamics satisfying the Hamilton's equations: Since the exact solution to the above is intractable, we can numerically approximate the temporal evolution using the leapfrog scheme, as described in the following pseudocode. for Set β ← β * ; Algorithm 1: The pseudocode of the No-U-Turn Hamiltonian Monte Carlo sampler for the Bayesian distance clustering.
To accelerate the convergence of the Markov chain to stationarity, we first use the BFGS optimization algorithm (implemented in the PyTorch package) to first minimize U (β) and obtain the posterior modeβ. We then initialize the Markov chain at β =β.
A typical choice for the working parameter M −1 is to let it roughly scale with the covariance matrix of the posterior distribution (Neal, 2011). Usingβ, we calculate the observed Fisher information atβ [the Hessian matrix of U (β) evaluated atβ, denoted by Hess U (β)], which gives an approximation to the inverse covariance of β. Although it is tempting to set M −1 = [Hess U (β)] −1 , the matrix inversion of the latter is often costly and ill-conditioned. To avoid this problem, we use a simpler and diagonal parameterization M −1 = diag(1/Hess U (β) i,i ), which shows good empirical performances in all the examples within this article.
To run the HMC sampler, we use the No-U-Turn Sampler (NUTS-HMC) algorithm (Hoffman and Gelman, 2014) implemented in the 'hamiltorch' package (Cobb and Jalaian, 2020), which also automatically tunes the other two working parameters and L. After the automatic tuning, the algorithm reaches an acceptance rate close to 70% as commonly desired for good mixing of the Markov chains. To provide some running time, using a quad-core i7 CPU, at n = 1000, the HMC algorithm takes about 20 minutes for running 10, 000 iterations.
Remark 11 On the computational cost, the most expensive step in the HMC algorithm is the calculation of the derivative of log G(D; W ) with respect to the matrix W , which involves the following form: where X ∈ R n×k , symmetric B ∈ R n×n and symmetric A ∈ R n×n . Since k is relatively small, the matrix inversion of the k×k matrix is not costly [O(k 3 )] and dominated by the matrix multiplication O(n 2 k). Therefore, running over L leapfrog steps, the computational cost per iteration of HMC is O (Ln 2 k).
Potentially, one could instead consider a Gibbs sampling algorithm, using a block-wise update of C T (log D)C and C T DC (instead of a full evaluation of the matrix product) when sampling each row of C. Despite having a similar computing complexity, a strength of HMC is that we can take advantage of the highly parallelized matrix operation on C, which is often faster than the sequential looping over each row of C.
In comparison, the parametric/model-based clustering algorithm has a lower cost of O(n), although this often comes with a risk of model misspecification for modern data. Therefore, choosing which class of methods involves a trade-off between computational speed versus model robustness.
The posterior samples of CC T give an estimate of the pairwise co-assignment probabilities pr(c i = c j ) = k h=1 pr(c i = c j = h). To obtain estimates for pr(c i = h), we use symmetric simplex matrix factorization (Duan, 2020) on {pr(c i = c j )} i,j to obtain an n × k matrix corresponding to {pr(c i = h)} i,h . For the diagnostics on the convergence, we calculate the autocorrelation (ACF) and the effective sample size (ESS) for each parameter, and we provide some diagnostic plots in the appendix.
In this article, for the ease of visualization and interpretation, we use pr(c i =ĉ i | D) as a measure of the uncertainty on the point estimateĉ i = max h=1,...,k pr(c i = h | D). An alternative is to use the variation of information (Wade and Ghahramani, 2018) as a metric between the clusterings, leading to the discrete extension of the posterior mean and credible intervals. The readers can find the method and toolbox within the reference.
In addition, in the appendix, we show that the non-negative matrix factorization (NMF) algorithm, if using a calibrated similarity matrix as its input (such as using our distance likelihood), produces an almost indistinguishable result from the Bayesian distance clustering method. On the other hand, if the similarity is set less carefully (such as using the "default" choice in popular machine learning packages), we found a severe sensitivity that leads to over-/under-estimation of the uncertainty (as shown in Panel 4 of Figure 8). Therefore, for the sake of both conciseness and fairness, we choose to not present the NMF results without calibration in the numerical experiments.    (ARI) is computed for the point estimates using variation of information. The average and 95% confidence interval are shown.
As described in Section 2.1, the vector norm-based distance is automatically robust to skewness. To illustrate, we generate n = 200 data from a two-component mixture of skewed Gaussians: where SN(µ, σ, α) has density π(y | µ, σ, α) = 2f {(y − µ)/σ}F {α(y − µ)/σ} with f and F the density and cumulative distribution functions for the standard Gaussian distribution. We start with p = 1 and assess the performance of the Bayesian distance clustering model under both non-skewed (α 1 = α 2 = 0, µ 1 = 0, µ 2 = 3) and skewed distributions (α 1 = 8, α 2 = 10, µ 1 = 0, µ 2 = 2). The results are compared against the mixture of Gaussians as implemented in the Mclust package. Figure 3(a,c) show that for non-skewed Gaussians, the proposed approach produces clustering probabilities close to their oracle probabilities, obtained using knowledge of the true kernels that generated the data. When the true kernels are skewed Gaussians, Figure 3(b,d) shows that the mixture of Gaussians gives inaccurate estimates of the clustering probability, whereas Bayesian distance clustering remains similar to the oracle.
To evaluate the accuracy of the point estimateĉ i , we compute the adjusted Rand index (Rand, 1971) with respect to the true labels. We test under different p ∈ {1, 5, 10, 30}, and repeat each experiment 30 times. The results are compared against model-based clustering using symmetric and skewed Gaussians kernels, using independent variance structure. As shown in Table 1, the misspecified symmetric model deteriorates quickly as p increases. In contrast, Bayesian distance clustering maintains high clustering accuracy.

Clustering high dimensional data with subspace distance
For high-dimensional clustering, it is often useful to impose the additional assumption that each cluster lives near a different low-dimensional manifold. Clustering data based on these manifolds is known as subspace clustering. We exploit the sparse subspace embedding algorithm proposed by Vidal (2011) to learn pairwise subspace distances. Briefly speaking, since the data in the same cluster are alike, each data point can be approximated as a linear combination of several other data points in the same subspace; hence a sparse locally linear embedding can be used to estimate an n × n coefficient matrixŴ througĥ  where the sparsity ofŴ ensures only the data in the same linear subspace can have non-zero embedding coefficients. Afterward, we can define a subspace distance matrix as where we follow Vidal (2011) to normalize each row by its absolute maximum. We then use this distance matrix in our Bayesian distance clustering method.
To assess the performance, we use the MNIST data of hand-written digits of 0 − 9, with each image having p = 28 × 28 pixels. In each experiment, we take n = 10, 000 random samples to fit the clustering models, among which each digit has approximately 1000 samples, and we repeat experiments 10 times. For comparison, we also run the near low-rank mixture model in HDclassif package (Bergé et al., 2012) and spectral clustering based on the p-dimensional vector norm. Our method using subspace distances shows clearly higher accuracy, as shown in Table 2.

Clustering brain regions
We carry out a data application to segment the mouse brain according to the gene expression obtained from the Allen Mouse Brain Atlas dataset (Lein et al., 2007). Specifically, the data are in situ hybridization gene expression, represented by expression volume over spatial voxels. Each voxel is a (200µm) 3 cube. We take the mid-coronal section of 41 × 58 voxels. Excluding the empty ones outside the brain, they have a sample size of n = 1781. For each voxel, there are records of expression volume over 3241 different genes. To avoid the curse of dimensionality for distances, we extract the first p = 30 principal components and use them as the source data.
Since gene expression is closely related to the functionality of the brain, we will use the clusters to represent the functional partitioning, and compare them in an unsupervised manner with known anatomical regions. The voxels belong to 12 macroscopic anatomical regions (Table 4).   (a) Anatomical structure labels.
(c) Point estimate from Bayesian Distance Clustering. Figure 4: Clustering mouse brain using gene expression: visualizing the clustering result on the first two principal components.
For clustering, we use an over-fitted mixture with k = 20 and small Dirichlet concentration parameter α = 1/20. As shown by Rousseau and Mengersen (2011), asymptotically, small α < 1 leads to the automatic emptying of small clusters; we observe such behavior here in this large sample. In the Markov chain, most iterations have 7 major clusters. Table 5 lists the voxel counts at c (n) .
Comparing the two tables, although we do not expect a perfect match between the structural and functional partitions, we do see a correlation in group sizes based on the top few groups. Indeed, visualized on the spatial grid ( Figure 5), the point estimates from Bayesian distance clustering have very high resemblance to the anatomical structure. Comparatively, the clustering result from the Gaussian mixture model is completely different.
To benchmark against other distance clustering approaches, we compute various similarity scores and list the results in Table 3. Competing methods include spectral clustering (Ng et al., 2002), DBSCAN (Ester et al., 1996) and HDClassif (Bergé et al., 2012); the first two are applied on the same dimension-reduced data as used by Bayesian distance clustering, while the last one   is applied directly on the high dimensional data. Among all the methods, the point estimates of Bayesian Distance Clustering have the highest similarity to the anatomical structure. Figure 5(d) shows the uncertainty about the point clustering estimates, in terms of the probability pr(c i =ĉ i ). Besides the area connecting neighboring regions, most of the uncertainty resides in the inner layers of the cortical plate (upper parts of the brain). As a result, the inner cortical plate can be either clustered with the outer layer or with the inner striatum region. Therefore, from a practical perspective, when segmenting the brain according to the functionality, it is more appropriate to treat the inner layers as a separate region.

Discussion
The use of a distance likelihood reduces the sensitivity to the choice of a mixture kernel, giving the ability to exploit distances for characterizing complex and structured data. While we avoid specifying the kernel, one potential weakness is that there can be sensitivity to the choice of the distance metrics. For example, the Euclidean distance tends to produce a more spherical cluster, compared to the weighted Euclidean distance (see appendix). However, our results suggest that this sensitivity is often less than that of the assumed kernel. In many settings, there is a rich literature considering how to carefully choose the distance metric to reflect structure in the data (Pandit and Gupta, 2011). In such cases, the sensitivity of clustering results to the distance can be viewed as a positive. Clustering method necessarily relies on some notion of distances between data points.
Another issue is that we give up the ability to characterize the distribution of the original data. An interesting solution is to consider a modular modeling strategy that connects the distance clustering to a post-clustering inference model while restricting the propagation of cluster information in one direction only. Related modular approaches have been shown to be much more robust than a single overarching full model (Jacob et al., 2017).
Our concentration characterization of the within-cluster distance based on the vector norm holds for any arbitrary p. On the other hand, high-dimensional clustering is a subtle topic with challenging issues: (i) not all the coordinates in R p contain discriminative information that is favorable for one particular partition; hence some alternative distances (Vidal, 2011), feature selection (Witten and Tibshirani, 2010), or multi-view clustering (Duan, 2020) may be necessary; (ii) the selection of number of clusters k becomes difficult, and it was recently discovered (Chandra et al., 2020) that the model-based framework could lead to nonsensical results of converging to either one cluster or n clusters even under a correctly specified model, as p → ∞.
One interesting extension of this work is to combine with the nearest neighbor algorithmwe can choose to ignore the large distances and instead focus on modeling the K smallest ones for each data point -this could also significantly reduce the O(n 2 ) computing and storage cost, via the sparse matrix computation. One possible model is to replace our Gamma distance density with a two-component mixture: one Gamma component concentrating near zero for modeling small distances, and one Uniform(0, max i,j d i,j ) for handling large distances. Since the uniform density is a constant that does not depend on the specific value of the distance (as long it is in the support of the uniform), it effectively eliminates the influence of large distances in clustering.

Proof of Lemma 7
Proof For a clear exposition, we omit the sub/super-script h for now and use x i = T (y i ) where ., . denotes dot product, the second equality is due to Fubini theorem and E y i x i − µ = 0.

Proof of Lemma 9
Proof Using 1 n,m n × m matrix with all elements equal 1. Since C T C = diag(n 1 , . . . , n k ), the 2 times of normalized graph cut loss can be written as tr A(1 n,k − C) C T C −1 C T = −tr AC C T C −1 C T + tr A1 n,k C T C −1 C T .
For the second term tr A1 n,k C  Table 4: Names and voxel counts in 12 macroscopic anatomical structures in the coronal section of the mouse brain. They represent the structural partitioning of the brain.
Index Voxel Count  1  626  2  373  3  176  4  113  5  79  6  39  7  12   Table 5: Group indices and voxel counts in 7 clusters found by Bayesian Distance Clustering, using the gene expression volume over the coronal section of the mouse brain. They represent the functional partitioning of the brain.

Numerical experiments showing the effect of discarding the seed
We show numerically that as n h increases, the seed conditional density K h y 1 | .) compared to G h . We repeat each experiment for 30 times and create the box plots. As can be seen from Figure 6 (left panel), the ratio quickly drops to near zero after n h ≥ 10. In addition, we repeat the same experiment except using a multivariate Gaussian N(0, I p ); and the findings are very similar (right panel).  becomes much smaller than the distance likelihood as n h increases.
The effect of distances on the shapes of clusters (a) Clustering using the Euclidean distance d i,j = y i − y j 2 , each cluster has a spherical shape.
(b) Clustering using the weighted Euclidean distance d i,j = (y i − y j ) T S(y i − y j ), with S = diag(s 1 , s 2 ), each induced cluster has an elliptical shape. Figure 7: Experiments show that the choice of distances may impact the shapes of the clusters and the uncertainty. Although, if one wants to separate clusters based on different 'shapes', a model on the cluster-specific covariances could be more useful than one on the pairwise distances.
Connection with the nonnegative matrix factorization methods Figure 8: Comparing the performance of the Bayesian distance clustering (BDC) and the nonnegative matrix factorization (NMF) using the symmetric simplex matrix factorization (Zhao et al., 2015;Duan, 2020). We apply the algorithms on the two clusters of skew Gaussian data as generated in Section 2.1, and plot the estimated co-assignment probability matrix for each method. As can be seen from Panels 2 and 3, when NMF is well calibrated in its similarity function (as we defined in (13), with parameters set to the posterior mean estimate from BDC), it produces a result almost indistinguishable from BDC -both are very close to the oracle co-assignment probability (Panel 1). On the other hand, NMF using an uncalibrated similarity exp(−d 2 i,j ) produces a result (Panel 4) very different from the oracle.